Tutorial Exercise 3

Estimating False Discovery Rates in Strategy Development

Author

Barry Quinn

Published

March 18, 2025

Code
# Load required libraries
library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)
library(data.table)
library(gridExtra) # For arranging multiple plots

Introduction

In this tutorial, we will explore the problem of false discoveries in quantitative finance, specifically focusing on how multiple testing leads to selection bias in strategy development. We will:

  1. Simulate the process of strategy development with random strategies
  2. Calculate the impact of selection bias on Sharpe ratios
  3. Implement the Deflated Sharpe Ratio (DSR) to correct for selection bias
  4. Analyze how false discovery rates change with different parameters

This exercise will provide hands-on experience with the concepts discussed in the lecture on backtest overfitting.

Why False Discovery Rates Matter in Finance

Financial strategy development differs fundamentally from many scientific disciplines:

  1. Limited data: Unlike fields with abundant data (e.g., genomics), finance has relatively few independent observations
  2. High noise-to-signal ratio: Market returns contain substantial random variation, making pattern detection challenging
  3. Non-stationarity: Financial markets evolve over time, invalidating previously discovered relationships
  4. Practical consequences: Implementing false strategies leads to real financial losses

When a researcher tests multiple strategies and selects only the best performer, they risk identifying patterns that exist purely by chance. Understanding false discovery rates is essential for distinguishing genuine market inefficiencies from statistical artifacts.

Part 1: Generating Random Strategies

Let’s first create a function to generate random strategy returns. These will simulate strategies with no genuine edge (i.e., the true Sharpe ratio is zero), as well as some strategies with a small genuine edge.

Code
# Function to generate random strategy returns
# This simulates the returns of investment strategies that have no real edge
# Modified function to generate strategies, some with genuine edge
generate_random_strategies <- function(n_strategies = 1000, 
                                       n_returns = 252, 
                                       mean_return = 0, 
                                       sd_return = 0.01,
                                       edge_pct = 0.05,    # Percentage of strategies with edge
                                       edge_size = 0.0005) # Size of daily edge (~12% annual)
{
  # Input validation
  if(!is.numeric(n_strategies) || n_strategies <= 0) 
    stop("n_strategies must be a positive number")
  if(!is.numeric(n_returns) || n_returns <= 0) 
    stop("n_returns must be a positive number")
  if(!is.numeric(edge_pct) || edge_pct < 0 || edge_pct > 1) 
    stop("edge_pct must be between 0 and 1")
  
  # Create a matrix of random returns (normally distributed)
  returns_matrix <- matrix(
    rnorm(n_strategies * n_returns, mean = mean_return, sd = sd_return),
    nrow = n_returns,
    ncol = n_strategies
  )
  
  # Add edge to a subset of strategies
  n_edge_strategies <- round(n_strategies * edge_pct)
  if (n_edge_strategies > 0) {
    # Add a small positive drift to create genuine edge
    for (i in 1:n_edge_strategies) {
      returns_matrix[, i] <- returns_matrix[, i] + edge_size
    }
  }
  
  # Name each strategy for easier reference
  colnames(returns_matrix) <- paste0("Strategy_", 1:n_strategies)
  
  # Create a vector indicating which strategies have true edge
  has_edge <- logical(n_strategies)
  has_edge[1:n_edge_strategies] <- TRUE
  
  return(list(
    returns = returns_matrix,
    has_edge = has_edge
  ))
}

# Generate strategies with some having genuine edge
set.seed(42)
strategy_data <- generate_random_strategies(
  n_strategies = 1000,
  n_returns = 252,
  edge_pct = 0.05,    # 5% of strategies have genuine edge
  edge_size = 0.0005  # Small daily drift (about 12% annual return)
)

# Extract the returns matrix and edge information
random_strategies <- strategy_data$returns
true_edge <- strategy_data$has_edge

# Check the dimensions to confirm we have 252 days × 1000 strategies
dim(random_strategies)
[1]  252 1000

Now, let’s calculate the Sharpe ratio for each of these strategies.

Code
# Function to calculate annualized Sharpe ratio
# The Sharpe ratio is the most common performance metric in finance
# It measures excess return per unit of risk (volatility)
calculate_sharpe <- function(returns, annualization_factor = 252) {
  # Calculate the mean daily return (this is our "signal")
  mean_return <- mean(returns)
  
  # Calculate the standard deviation of daily returns (this is our "noise")
  sd_return <- sd(returns)
  
  # Avoid division by zero (if we get a constant return series)
  if (sd_return == 0) return(0)
  
  # Calculate Sharpe ratio = signal/noise, then annualize
  # Annualization scales the daily Sharpe by sqrt(252) to get annual equivalent
  sharpe <- mean_return / sd_return * sqrt(annualization_factor)
  
  return(sharpe)
}

# Apply the Sharpe ratio calculation to each strategy (column)
# The "apply" function iterates through columns (2) of the matrix
sharpe_ratios <- apply(random_strategies, 2, calculate_sharpe)

# Examine the distribution statistics of these Sharpe ratios
# Even with random data, some strategies will appear to perform well!
summary(sharpe_ratios)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-3.12341 -0.63281 -0.00652  0.02047  0.69419  3.22375 
Code
# Separate Sharpe ratios for strategies with and without edge
edge_sharpes <- sharpe_ratios[true_edge]
no_edge_sharpes <- sharpe_ratios[!true_edge]

# Create a data frame for plotting
sharpe_df <- data.frame(
  Sharpe = sharpe_ratios,
  Has_Edge = factor(true_edge, levels = c(FALSE, TRUE),
                   labels = c("No Edge", "True Edge"))
)

# Visualize the distribution of Sharpe ratios by group
ggplot(sharpe_df, aes(x = Sharpe, fill = Has_Edge)) +
  geom_histogram(bins = 30, alpha = 0.7, position = "identity") +
  labs(
    title = "Distribution of Sharpe Ratios for Random Strategies",
    subtitle = paste("Mean =", round(mean(sharpe_ratios), 2), 
                    "SD =", round(sd(sharpe_ratios), 2),
                    "| 5% of strategies have true edge"),
    x = "Sharpe Ratio",
    y = "Count",
    fill = "Strategy Type"
  ) +
  theme_minimal() +
  # Add a vertical line at zero - the theoretical expectation for random strategies
  geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
  scale_fill_manual(values = c("No Edge" = "steelblue", "True Edge" = "darkred"))

The histogram above shows the distribution of Sharpe ratios for our 1,000 randomly generated strategies. This is a critical visualization that illustrates several important concepts in quantitative finance:

Key Observations:

  1. Normal Distribution: Notice how the Sharpe ratios for strategies without edge follow an approximately normal distribution, centered near zero. This is exactly what we would expect when strategies have no genuine edge – their performance is purely random.

  2. Standard Deviation: The standard deviation of Sharpe ratios tells us about the spread of performance metrics. This is crucial because it determines how impressive a “good” Sharpe ratio needs to be to stand out from random noise.

  3. Range of Values: Even though most strategies are completely random (with no edge), we see Sharpe ratios ranging from approximately -3 to +3. In practical terms, a Sharpe ratio of +2 is typically considered excellent in the investment industry! Yet here we see several random strategies achieving this level purely by chance.

  4. Strategies with Edge: The strategies with genuine edge (in dark red) tend to have higher Sharpe ratios on average, but there’s substantial overlap with the no-edge strategies. This illustrates why it’s so difficult to identify genuine strategies from random ones based on Sharpe ratio alone.

Implications for Strategy Development:

This distribution demonstrates why statistical significance is so important in strategy evaluation. If you tested just one strategy and it showed a Sharpe ratio of 1.5, you might be excited about its performance. However, this chart shows that among 1,000 random strategies, we’d expect several to show Sharpe ratios of 1.5 or higher simply due to chance.

This is precisely why multiple testing is so dangerous in quantitative finance. When researchers or traders test many strategy variations and report only the best performer, they are essentially “selecting” from the right tail of this distribution – capturing lucky outcomes rather than genuine edge.

In the next sections, we’ll explore exactly how the maximum Sharpe ratio increases with the number of trials, and how we can use the Deflated Sharpe Ratio to correct for this selection bias.

Checkpoint Questions: Understanding Random Strategy Performance

Take a moment to answer these questions:

  1. What key observation can we make about the distribution of Sharpe ratios from our random strategies?
  2. Why did some random strategies show Sharpe ratios above 2.0 despite having no true edge?
  3. If you developed a strategy with a Sharpe ratio of 1.8, how would you assess whether it was likely to be genuine?
Click for answers
  1. The Sharpe ratios follow an approximately normal distribution centered near zero (mean = -0.02), with a standard deviation of approximately 0.98.
  2. With 1,000 random strategies, we expect to see some extremely high Sharpe ratios purely by chance - this is a statistical property of multiple testing.
  3. You should consider how many alternative strategies were tested, estimate the expected maximum Sharpe ratio under the null hypothesis of no edge, and calculate the Deflated Sharpe Ratio to assess the probability that your 1.8 Sharpe is not simply the result of selection bias.

Part 2: Exploring Selection Bias

Now, let’s simulate the process of strategy selection, where a researcher tests multiple strategies and selects the one with the highest Sharpe ratio.

Code
# Find the maximum Sharpe ratio from our 1000 strategies
# This simulates what happens when a researcher selects only the best-performing strategy
max_sharpe <- max(sharpe_ratios)
max_sharpe_index <- which.max(sharpe_ratios)

# Check if the best-performing strategy has true edge
has_true_edge <- true_edge[max_sharpe_index]

# Print the maximum Sharpe ratio and which strategy achieved it
# This is often what would be presented in a backtest report or research paper
cat("Maximum Sharpe Ratio:", round(max_sharpe, 2), 
    "\nStrategy Index:", max_sharpe_index, 
    "\nHas True Edge:", has_true_edge, "\n")
Maximum Sharpe Ratio: 3.22 
Strategy Index: 851 
Has True Edge: FALSE 
Code
# Define a function to calculate the expected maximum Sharpe ratio
# This implements the False Strategy Theorem from the lecture
# It predicts how high the max Sharpe should be just from random chance
expected_max_sharpe <- function(n_trials, mean_sr = 0, std_sr = 1) {
  # Euler-Mascheroni constant (mathematical constant appearing in the theorem)
  emc <- 0.57721566490153286060651209008240243104215933593992
  
  # Expected maximum Sharpe ratio formula:
  # The more trials we run, the higher this value becomes, even with random data
  sr0 <- (1 - emc) * qnorm(p = 1 - 1/n_trials) + 
        emc * qnorm(1 - (n_trials * exp(1))^(-1))
  
  # Adjust by the mean and standard deviation of the Sharpe ratio distribution
  sr0 <- mean_sr + std_sr * sr0
  
  return(sr0)
}

# Calculate the expected maximum Sharpe ratio for 1000 trials
# We use the actual standard deviation of our Sharpe ratios for accuracy
exp_max_sharpe <- expected_max_sharpe(1000, mean_sr = 0, std_sr = sd(sharpe_ratios))

# Compare the theoretical expectation with our actual observed maximum
# They should be relatively close if our strategies are truly random
cat("Expected Maximum Sharpe Ratio:", round(exp_max_sharpe, 2), "\n")
Expected Maximum Sharpe Ratio: 3.23 
Code
cat("Actual Maximum Sharpe Ratio:", round(max_sharpe, 2), "\n")
Actual Maximum Sharpe Ratio: 3.22 

Let’s visualize how the maximum Sharpe ratio increases with the number of trials.

Code
# Calculate expected maximum Sharpe ratio for different numbers of trials
# This helps us visualize how the "best strategy" improves just by testing more variations
trial_counts <- c(1, 5, 10, 50, 100, 500, 1000)
exp_max_sharpes <- sapply(trial_counts, expected_max_sharpe, 
                          mean_sr = 0, std_sr = sd(sharpe_ratios))

# Function to simulate actual maximum Sharpe ratios for different trial counts
# This gives us an empirical check against the theoretical prediction
simulate_max_sharpe <- function(n_trials, n_simulations = 100, 
                               mean_sr = 0, std_sr = 1) {
  # We'll run multiple simulations to get a distribution of maximum Sharpe ratios
  max_sharpes <- numeric(n_simulations)
  
  # For each simulation, generate n_trials random Sharpe ratios and find the maximum
  for (i in 1:n_simulations) {
    # Generate random Sharpe ratios with specified mean and std
    sharpes <- rnorm(n_trials, mean = mean_sr, sd = std_sr)
    # Record the maximum value - this simulates selecting the best strategy
    max_sharpes[i] <- max(sharpes)
  }
  
  # Return statistics about the distribution of maximum Sharpe ratios
  return(list(
    mean = mean(max_sharpes),  # Average maximum across simulations
    sd = sd(max_sharpes),      # Standard deviation of the maximums
    values = max_sharpes       # All the individual maximum values
  ))
}

# Run the simulation for each trial count
# For each number of trials, we simulate 100 sets to get a good estimate
set.seed(123)  # For reproducibility
simulated_results <- lapply(trial_counts, simulate_max_sharpe, 
                           n_simulations = 100, 
                           mean_sr = 0, 
                           std_sr = sd(sharpe_ratios))

# Extract the mean maximum Sharpe ratio from each simulation set
sim_max_sharpes <- sapply(simulated_results, function(x) x$mean)

# Create a data frame for plotting both theoretical and simulated results
plot_data <- tibble(
  Trials = rep(trial_counts, 2),  # Each trial count appears twice (theoretical and simulated)
  `Sharpe Ratio` = c(exp_max_sharpes, sim_max_sharpes),  # Values from both methods
  Type = rep(c("Theoretical", "Simulated"), each = length(trial_counts))  # Identify the source
)

# Create the plot comparing theoretical vs. simulated maximum Sharpe ratios
# This is a key visualization demonstrating how selection bias increases with trials
ggplot(plot_data, aes(x = Trials, y = `Sharpe Ratio`, color = Type)) +
  geom_line(size = 1) +
  geom_point(size = 3) +
  scale_x_log10() +  # Log scale makes the pattern clearer across different trial counts
  labs(
    title = "Maximum Sharpe Ratio vs. Number of Trials",
    subtitle = "Theoretical (False Strategy Theorem) vs. Simulated",
    x = "Number of Trials (log scale)",
    y = "Maximum Sharpe Ratio"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("Theoretical" = "darkblue", "Simulated" = "darkred"))

The results above demonstrate one of the most important concepts in quantitative finance: the expected maximum Sharpe ratio increases systematically with the number of trials, even when all strategies have zero true edge.

Looking at our data:

  1. With just a single random strategy (trials = 1), the expected maximum Sharpe ratio is close to zero.
  2. With 10 trials, we expect a maximum Sharpe around 1.5.
  3. With 100 trials, the expected maximum rises to about 2.3.
  4. With 1000 trials, we expect a maximum Sharpe ratio above 3.0!

This relationship is both theoretical (as predicted by the False Strategy Theorem) and empirical (as shown by our simulations). The close match between our theoretical and simulated lines confirms the validity of the theorem.

The implications for research are profound:

  • A strategy with a Sharpe ratio of 2.0 might seem impressive in isolation
  • But if it was selected as the best out of 100+ configurations tested, it’s actually exactly what we’d expect from random chance
  • This explains why so many strategies that look excellent in backtests fail in live trading

This is why proper statistical adjustments like the Deflated Sharpe Ratio (which we’ll explore next) are essential when evaluating investment strategies that resulted from backtesting multiple configurations.

The Selection Bias Visual

Let’s visualize the selection bias problem more directly:

Code
# Generate a set of Sharpe ratios for 100 random strategies
set.seed(456)
random_sharpes <- rnorm(100, mean = 0, sd = 1)

# Create a data frame for visualization
selection_bias_df <- data.frame(
  Strategy = 1:100,
  Sharpe = random_sharpes
)

# Plot the Sharpe ratios with the maximum highlighted
ggplot(selection_bias_df, aes(x = Strategy, y = Sharpe)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray") +
  geom_segment(aes(xend = Strategy, yend = 0), color = "steelblue", alpha = 0.5) +
  geom_point(color = "steelblue", size = 2) +
  geom_point(data = selection_bias_df[which.max(random_sharpes),], 
             color = "red", size = 4) +
  annotate("text", x = which.max(random_sharpes), 
           y = max(random_sharpes) + 0.3, 
           label = paste("Max SR =", round(max(random_sharpes), 2)), 
           color = "red") +
  labs(
    title = "Selection Bias in Strategy Development",
    subtitle = "100 random strategies with no true edge",
    x = "Strategy Number",
    y = "Sharpe Ratio"
  ) +
  theme_minimal()

This visualization shows exactly what happens in strategy selection: we run many trials and pick the best one. The strategy with the highest Sharpe ratio (highlighted in red) looks impressive, but it’s purely a result of random chance. If we were to implement this strategy, we would likely be disappointed by its future performance.

Part 3: Calculating the Deflated Sharpe Ratio

  • Recall from the lecture

  • The Deflated Sharpe Ratio computes the probability that the Sharpe Ratio (SR) is statistically significant.

\[\widehat{DSR} \equiv \widehat{PSR}(\widehat{SR_0})=Z \left[\frac{(\hat{SR}-E[\max_k(\widehat{SR_k})])\sqrt{T-1}}{\sqrt{1-\hat{\gamma_3}\widehat{SR}+\frac{\hat{\gamma_4}-1}{4}\widehat{SR}^2}}\right]\]

  • \(\widehat{DSR}\) can be interpreted as the probability of observing a Sharpe ratio greater or equal to \(\widehat{SR}\) subject to the null hypothesis that the true Sharpe ratio is zero, while adjusting for skewness \(\gamma_3\), kurtosis \(\gamma_4\), sample length and multiple testings.

  • Calculate DSR requires the estimation \(E[\max_k(\widehat{SR_k})]\) which requires estimating \(K\) and \(V(\hat{SR})\) which is where financial machine learning can help.

  • Specifically, we are employ optimal number of clustering to estimate \(K\) the effective number of trails and then calculate the variances.

Now, let’s implement the Deflated Sharpe Ratio to correct for selection bias under multiple testing.

Code
# Function to calculate skewness and excess kurtosis
# These higher moments are important because financial returns are rarely normally distributed
calculate_moments <- function(returns) {
  # Standardize returns to make calculations easier
  # This converts returns to units of standard deviation (z-scores)
  z <- (returns - mean(returns)) / sd(returns)
  
  # Calculate skewness - the third central moment
  # Skewness measures asymmetry: positive values indicate a right tail (occasional large gains)
  # Negative values indicate a left tail (occasional large losses)
  skew <- mean(z^3)
  
  # Calculate excess kurtosis - the fourth central moment minus 3
  # Kurtosis measures "tailedness": high values indicate fat tails (more extreme events)
  # For a normal distribution, kurtosis = 3, so excess kurtosis = 0
  kurt <- mean(z^4) - 3
  
  return(list(skewness = skew, kurtosis = kurt))
}

# Function to calculate the Deflated Sharpe Ratio
calculate_dsr <- function(returns, n_trials, sr_mean = 0, sr_std = NULL) {
  # Input validation
  if(!is.numeric(returns) || length(returns) < 10) {
    stop("Returns must be a numeric vector with sufficient observations")
  }
  if(!is.numeric(n_trials) || n_trials <= 0) {
    stop("Number of trials must be a positive number")
  }
  
  # Number of observations
  n <- length(returns)
  
  # Check for constant returns (all values the same)
  if(sd(returns) == 0 || is.na(sd(returns))) {
    warning("Returns have zero standard deviation, returning default values")
    return(list(
      sharpe_ratio = 0,
      expected_max_sr = 0,
      dsr = 0
    ))
  }
  
  # Step 1: Calculate Sharpe ratio (non-annualized)
  sr <- mean(returns) / sd(returns)
  
  # Step 2: Calculate higher moments for more accurate distribution
  moments <- calculate_moments(returns)
  skew <- moments$skewness
  kurt <- moments$kurtosis
  
  # Step 3: Set SR standard deviation if not provided
  if (is.null(sr_std) || is.na(sr_std) || sr_std <= 0) {
    warning("Using default SR standard deviation of 1")
    sr_std <- 1  # Default value
  }
  
  # Step 4: Calculate expected maximum Sharpe ratio
  exp_max_sr <- expected_max_sharpe(n_trials, mean_sr = sr_mean, std_sr = sr_std)
  
  # Step 5: Calculate Deflated Sharpe Ratio
  numerator <- (sr - exp_max_sr) * sqrt(n - 1)
  denominator <- sqrt(1 - skew * sr + (kurt / 4) * sr^2)
  
  # Handle potential numerical instability
  if (is.na(denominator) || denominator <= 0) {
    warning("Denominator calculation issue, using default value")
    denominator <- 1  # Use a safe default
  }
  
  # Final DSR calculation (probability interpretation)
  dsr <- pnorm(numerator / denominator)
  
  # Return results with clear names
  return(list(
    sharpe_ratio = sr,
    expected_max_sr = exp_max_sr,
    dsr = dsr,
    skewness = skew,
    kurtosis = kurt,
    n_observations = n
  ))
}

# Visualize the DSR calculation process
library(DiagrammeR)

# Create flow diagram of DSR calculation process
grViz("
digraph DSR_calculation {
  # Node definitions
  node [shape = rectangle, style = filled, fillcolor = lightblue, fontname = Helvetica]
  
  A [label = 'Input: Strategy returns\\nand number of trials']
  B [label = 'Calculate Sharpe ratio\\n(SR = mean/std)']
  C [label = 'Calculate higher moments\\n(skewness, kurtosis)']
  D [label = 'Estimate expected maximum SR\\n(False Strategy Theorem)']
  E [label = 'Calculate DSR\\n(probability measure)']
  F [label = 'Interpret DSR result\\n(probability of true discovery)']
  
  # Edge definitions
  A -> B
  B -> C
  {B; C} -> D
  {B; C; D} -> E
  E -> F
  
  # Graph attributes
  graph [rankdir = TB, splines = true, nodesep = 0.8]
}
")
Code
# Calculate DSR for our maximum Sharpe ratio strategy
# This applies our DSR function to the strategy that looked best in the backtest
best_strategy_returns <- random_strategies[, max_sharpe_index]
dsr_result <- calculate_dsr(
  best_strategy_returns, 
  n_trials = 1000,  # We tested 1000 strategies
  sr_std = sd(sharpe_ratios)  # Use the actual variability in our Sharpe ratios
)

# Print the results
# These metrics tell us whether our "best strategy" is likely genuine or just lucky
cat("Sharpe Ratio:", round(dsr_result$sharpe_ratio * sqrt(252), 2), "(annualized)\n")
Sharpe Ratio: 3.22 (annualized)
Code
cat("Expected Max Sharpe Ratio:", round(dsr_result$expected_max_sr, 2), "\n")
Expected Max Sharpe Ratio: 3.23 
Code
cat("Deflated Sharpe Ratio:", round(dsr_result$dsr, 4), "\n")
Deflated Sharpe Ratio: 0 
Code
cat("Skewness:", round(dsr_result$skewness, 2), "\n")
Skewness: -0.16 
Code
cat("Excess Kurtosis:", round(dsr_result$kurtosis, 2), "\n")
Excess Kurtosis: -0.27 

Understanding the Deflated Sharpe Ratio Result

The Deflated Sharpe Ratio (DSR) we’ve calculated represents the probability that our strategy’s performance is not merely due to selection bias. Let’s interpret our results:

  1. Sharpe Ratio: Our best strategy has an annualized Sharpe ratio of 3.22. In traditional finance, this would be considered excellent performance, well above the typical threshold of 1.0 for investment consideration.

  2. Expected Maximum Sharpe Ratio: However, given that we tested 1000 strategies, we would expect to find a maximum Sharpe ratio of approximately 3.20 purely by chance. This is a critical benchmark that any truly successful strategy must exceed.

  3. Deflated Sharpe Ratio: The DSR of 0.00 means there’s effectively a 0% probability that our strategy’s performance is not simply the result of selection bias.

To put this in perspective:

  • A DSR < 0.5 (50%) suggests the strategy is more likely to be a false discovery than a true one
  • A DSR < 0.05 (5%) indicates very strong evidence that the strategy is merely a product of selection bias
  • A DSR > 0.95 (95%) would provide strong evidence that the strategy has genuine merit

Our result of 0.00 indicates that despite the impressive Sharpe ratio of 3.22, our best strategy is almost certainly a false discovery. This perfectly illustrates why traditional performance metrics like the Sharpe ratio can be deceiving when multiple strategies are tested.

Industry Decision Framework

In professional quantitative investment firms, DSR thresholds often guide strategy implementation decisions:

DSR Range Interpretation Typical Action
0.00-0.20 Strong evidence of false discovery Reject strategy
0.20-0.50 Likely false discovery Reject or conduct extensive additional testing
0.50-0.80 Uncertain Additional testing with independent data required
0.80-0.95 Potentially valid Implement with caution, possibly at reduced scale
0.95-1.00 Strong evidence of valid strategy Full implementation consideration

This analysis demonstrates why many strategies that look promising in backtests fail when implemented: they were simply the lucky winners in a multiple testing scenario, not strategies with genuine predictive power.

In the next section, we’ll explore how different parameters affect the false discovery rate, which will further illustrate why traditional backtesting approaches are so problematic.

Part 4: False Discovery Rate Analysis

Let’s analyze how the false discovery rate changes with different parameters.

Code
# Function to calculate precision, recall, and False Discovery Rate (FDR)
# This implements the mathematical framework from Lopez de Prado discussed in the lecture
calculate_fdr <- function(ground_truth, alpha = 0.05, beta = 0.2) {
  # Input validation
  if(ground_truth < 0 || ground_truth > 1) 
    stop("ground_truth must be between 0 and 1")
  if(alpha < 0 || alpha > 1) 
    stop("alpha must be between 0 and 1")
  if(beta < 0 || beta > 1) 
    stop("beta must be between 0 and 1")
  
  # Convert ground truth probability to odds ratio (theta)
  # ground_truth = proportion of strategies that are truly profitable
  # theta = ratio of true strategies to false strategies
  theta <- ground_truth / (1 - ground_truth)
  
  # Calculate recall (true positive rate)
  # Recall = 1 - beta, where beta is the Type II error rate (false negative rate)
  # This represents the probability of detecting a true strategy
  recall <- 1 - beta          
  
  # Calculate numerator for precision calculation
  # b1 = recall * theta = number of true positives / number of false strategies
  b1 <- recall * theta
  
  # Calculate precision
  # Precision = true positives / all positives
  # This represents the probability that a positive test indicates a true strategy
  precision <- b1 / (b1 + alpha)
  
  # Calculate False Discovery Rate (FDR)
  # FDR = 1 - precision = false positives / all positives
  # This is the probability that a strategy that tests positive is actually false
  fdr <- 1 - precision
  
  # Return all relevant metrics in a tidy format
  return(tibble(
    ground_truth = ground_truth,  # Original input - prior probability of true strategy
    theta = theta,                # Odds ratio of true vs. false strategies
    alpha = alpha,                # Type I error rate (significance level)
    beta = beta,                  # Type II error rate (1 - power)
    recall = recall,              # True positive rate (power)
    precision = precision,        # Proportion of positives that are true
    fdr = fdr                     # Proportion of positives that are false
  ))
}

# Calculate FDR for different ground truth probabilities
# This shows how the FDR changes based on the prior probability of true strategies
ground_truths <- seq(0.01, 0.5, by = 0.01)  # Try values from 1% to 50%
fdr_results <- map_df(ground_truths, calculate_fdr)  # Apply function to each value

# Plot the results to visualize the relationship
# This is a key insight: even with standard statistical testing, FDR remains high
# when true strategies are rare (which is the case in finance)
ggplot(fdr_results, aes(x = ground_truth)) +
  geom_line(aes(y = precision, color = "Precision"), size = 1) +
  geom_line(aes(y = fdr, color = "FDR"), size = 1) +
  scale_color_manual(values = c("Precision" = "blue", "FDR" = "red")) +
  labs(
    title = "Precision and False Discovery Rate vs. Ground Truth Probability",
    subtitle = "Alpha = 0.05, Beta = 0.2",
    x = "Ground Truth Probability (Proportion of True Strategies)",
    y = "Rate",
    color = "Metric"
  ) +
  theme_minimal() +
  scale_x_continuous(labels = scales::percent) +  # Format x-axis as percentages
  scale_y_continuous(labels = scales::percent) +   # Format y-axis as percentages
  # Add a vertical line at the point where precision = 0.5 (FDR = 0.5)
  geom_vline(xintercept = 0.0625, linetype = "dashed", color = "black") +
  annotate("text", x = 0.09, y = 0.5, 
           label = "Precision = 50%\nwhen ground truth ≈ 6.25%", 
           hjust = 0)

Intuitive Explanation of the Precision and FDR Graph

This graph illustrates a fundamental challenge in quantitative investment strategy evaluation: understanding the true reliability of seemingly promising backtest results.

What the Graph Shows

The graph plots two key metrics:

— Precision (blue line) and False Discovery Rate (red line)

—against the “Ground Truth Probability,” which represents the proportion of strategies that are genuinely profitable in the universe of all possible strategies you might test.

Key Insights

  1. The Critical Threshold: The dotted vertical line marks where precision equals 50% (at approximately 6.25% ground truth probability). This means that when fewer than 6.25% of all possible strategies are genuinely profitable, you’re more likely to be looking at a false positive than a true discovery, even when using standard statistical significance levels (α = 0.05).

  2. Low Ground Truth Environment: The finance industry likely operates in the far left region of this graph, where true profitable strategies are rare (perhaps 1-5% of all possibilities). In this region, the False Discovery Rate (red line) is extremely high—potentially 80-95%.

  3. Statistical Power Balance: The graph is calculated with α = 0.05 (significance level) and β = 0.2 (false negative rate), meaning your statistical test has 80% power. Despite these seemingly rigorous parameters, precision remains problematically low when true strategies are rare.

Practical Implications

This relationship explains why so many strategies that look promising in backtests fail in real-world implementation. Even with conventional statistical safeguards:

  • If only 5% of strategies are truly profitable, approximately 50% of strategies that pass your statistical tests will still be false positives.
  • If only 1% of strategies are truly profitable (which may be realistic in efficient markets), over 80% of “discoveries” will be false.

The fundamental issue is that standard statistical significance (p-values) tells you the probability of seeing your results if there were no true effect, but doesn’t tell you the probability that your discovery is genuine. This probability critically depends on the prevalence of true effects in your search space.

This graph provides strong mathematical justification for the need for robust adjustments like the Deflated Sharpe Ratio, out-of-sample testing, and other methods to control for selection bias under multiple testing.

Now, let’s see how the FDR changes with different significance levels (alpha).

Code
# Calculate FDR for different alpha values (significance levels)
alphas <- seq(0.01, 0.1, by = 0.01)  # Test alpha values from 1% to 10%

# Create combinations of ground truth probabilities and alphas
fdr_by_alpha <- expand.grid(
  initial_ground_truth = c(0.01, 0.05, 0.1, 0.2),  # Different prior probabilities
  initial_alpha = alphas                           # Different significance levels
) %>%
  as_tibble() %>%
  rowwise() %>%
  # Calculate FDR for each combination
  mutate(
    result = list(calculate_fdr(initial_ground_truth, initial_alpha, beta = 0.2))
  ) %>%
  unnest(result) %>%
  # Use the values from the calculate_fdr result, dropping the duplicates
  select(-initial_ground_truth, -initial_alpha)

# Plot FDR vs alpha for different ground truth probabilities
ggplot(fdr_by_alpha, aes(x = alpha, y = fdr, color = factor(ground_truth))) +
  geom_line(size = 1) +
  labs(
    title = "False Discovery Rate vs. Significance Level",
    subtitle = "For Different Ground Truth Probabilities",
    x = "Significance Level (Alpha)",
    y = "False Discovery Rate",
    color = "Ground Truth"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent) +
  scale_color_discrete(name = "Ground Truth", 
                      labels = scales::percent(unique(fdr_by_alpha$ground_truth)))

Understanding False Discovery Rates in Finance

The plots above reveal a profound challenge in quantitative finance - the false discovery rate is alarmingly high even with traditional statistical safeguards.

Key Insights from the Second Plot:
  1. Significance Level Impact: Making our statistical tests more stringent (lower α) does reduce the false discovery rate, but the effect is modest compared to the impact of the ground truth probability.

  2. Diminishing Returns: Even with a very strict significance level of α=0.01, the false discovery rate remains above 80% when the ground truth probability is just 1%.

  3. The Multiple Testing Connection: This analysis reveals why multiple testing is so problematic in finance - each test increases the opportunity for false positives in an environment where true positives are rare.

These findings help explain why so many published financial strategies fail to replicate and why institutional investors are rightfully skeptical of backtested performance. They also underscore the importance of methods like the Deflated Sharpe Ratio, which explicitly account for multiple testing and the low prior probability of true strategies.

Checkpoint Questions: False Discovery Rates

Take a moment to reflect on these questions:

  1. Why is the false discovery rate so high in finance compared to other fields?
  2. If you were examining a strategy with a statistically significant p-value of 0.01, what other information would you need to assess its true validity?
  3. How might you increase the prior probability (ground truth) of your strategies before backtesting?
Click for answers
  1. The false discovery rate is high in finance because the prior probability of true strategies (ground truth) is typically very low due to market efficiency. Even with strict statistical tests, if the base rate of true strategies is low, most “discoveries” will be false positives.

  2. You would need to know: (a) how many strategies were tested before finding this one (to assess selection bias), (b) the prior probability of true strategies in your domain, and (c) the power of your test (1-β). With this information, you could calculate the true probability that your strategy is a false discovery.

  3. To increase prior probability, you could: (a) develop strategies based on strong economic rationales rather than data mining, (b) focus on areas of the market with known inefficiencies, (c) incorporate non-public information (where legally permitted), and (d) apply domain expertise to filter strategy ideas before backtesting.

Part 5: Impact of Sample Size on DSR

Let’s investigate how the sample size affects the Deflated Sharpe Ratio.

Code
# Generate random strategies with different sample sizes
# This helps us understand how the amount of data affects our ability to detect true strategies
sample_sizes <- c(63, 126, 252, 504, 1008)  # Approx. 3 months to 4 years of daily data

# Function to calculate DSR for different sample sizes
calculate_dsr_by_sample <- function(sample_size, n_strategies = 100, n_trials = 1000,
                                    edge_pct = 0.05, edge_size = 0.05) {
  # Generate strategies
  strategies_data <- generate_random_strategies(
    n_strategies = n_strategies,
    n_returns = sample_size,
    edge_pct = edge_pct,
    edge_size = edge_size
  )
  
  strategies <- strategies_data$returns
  has_edge <- strategies_data$has_edge
  
  # Calculate Sharpe ratios
  sharpes <- apply(strategies, 2, calculate_sharpe, 
                  annualization_factor = 252)
  
  # Find max Sharpe and its index
  max_sharpe <- max(sharpes)
  max_idx <- which.max(sharpes)
  
  # Check if max strategy has true edge
  best_has_edge <- has_edge[max_idx]
  
  # Add error handling
  tryCatch({
    # Calculate DSR for best strategy
    dsr_result <- calculate_dsr(
      strategies[, max_idx],
      n_trials = n_trials,
      sr_std = sd(sharpes)
    )
    
    return(tibble(
      sample_size = sample_size,
      sharpe_ratio = dsr_result$sharpe_ratio * sqrt(252),  # Annualized
      expected_max_sr = dsr_result$expected_max_sr,
      dsr = dsr_result$dsr,
      has_edge = best_has_edge
    ))
  }, error = function(e) {
    # Return NA values if there's an error
    return(tibble(
      sample_size = sample_size,
      sharpe_ratio = NA_real_,
      expected_max_sr = NA_real_,
      dsr = NA_real_,
      has_edge = best_has_edge
    ))
  })
}

# Run multiple simulations for each sample size to get a distribution
# This gives us statistical confidence in our results rather than relying on a single simulation
set.seed(456)  # For reproducibility
n_simulations <- 50  # 50 simulations per sample size
dsr_by_sample <- map_df(sample_sizes, function(sample_size) {
  # For each sample size, run n_simulations iterations
  map_df(1:n_simulations, function(i) {
    calculate_dsr_by_sample(sample_size)
  })
})

# Plot the results using boxplots to show the distribution of DSR values
# This visualization shows how sample size affects the reliability of the DSR
ggplot(dsr_by_sample, aes(x = factor(sample_size), y = dsr, fill = has_edge)) +
  geom_boxplot() +
  labs(
    title = "Distribution of Deflated Sharpe Ratio by Sample Size",
    subtitle = "100 Strategies, 1000 Trials, 50 Simulations per Sample Size",
    x = "Sample Size (Number of Returns)",
    y = "Deflated Sharpe Ratio",
    fill = "Strategy Has\nTrue Edge"
  ) +
  theme_minimal() +
  scale_fill_manual(values = c("FALSE" = "lightblue", "TRUE" = "darkred"))

The Critical Impact of Sample Size on Strategy Evaluation

The boxplot visualization above reveals several crucial insights about how sample size affects our ability to distinguish true strategies from false discoveries:

1. DSR Variability Decreases with Sample Size

With small sample sizes (e.g., 63 days, or approximately 3 months of data), the DSR values show high variability. This means that with limited data, our assessment of whether a strategy is a true discovery or just lucky can change dramatically from one sample to another. This instability makes it dangerous to rely on short backtests.

2. DSR Distribution Shifts with Sample Size

Notice how the median DSR value changes across different sample sizes. With very small samples, we often get misleadingly high DSR values because the estimation error in both the Sharpe ratio and its variance is large. As the sample size increases, the DSR distribution typically converges toward a more accurate assessment.

3. Strategies with True Edge Benefit from Larger Samples

Strategies with genuine edge (shown in dark red) tend to have higher DSR values as the sample size increases. This demonstrates that with enough data, the DSR becomes better at distinguishing true strategies from false discoveries. However, even with large samples, there’s still overlap between strategies with and without edge.

4. Implications for Backtest Length

This analysis provides a quantitative basis for the common industry practice of requiring multiple years of backtest data. We can now see precisely how shorter backtests increase the risk of false discoveries:

  • With 3 months of data (63 points), DSR assessments are highly unreliable
  • With 1 year of data (252 points), we start to get more stable assessments
  • With 4 years of data (1008 points), our DSR estimates become much more trustworthy

5. The Tradeoff with Market Stationarity

However, there’s an important tradeoff: while longer backtests provide more statistical confidence, they also span more market regimes and potential structural changes. A strategy that worked well 4 years ago might no longer be effective today due to changing market conditions or competitor activity.

This highlights the need for both statistical rigor (longer samples) and economic reasoning (consideration of changing market dynamics) when evaluating investment strategies.

Statistical Significance vs. Strategy Stability

Let’s examine the relationship between statistical significance (Sharpe ratio), DSR, and sample size:

Code
# Calculate average DSR for strategies with and without edge across sample sizes
avg_dsr_by_sample <- dsr_by_sample %>%
  group_by(sample_size, has_edge) %>%
  summarize(
    avg_dsr = mean(dsr, na.rm = TRUE),
    avg_sharpe = mean(sharpe_ratio, na.rm = TRUE),
    .groups = 'drop'
  )

# Plot the average DSR and Sharpe ratio by sample size
p1 <- ggplot(avg_dsr_by_sample, aes(x = factor(sample_size), y = avg_dsr, 
                               color = has_edge, group = has_edge)) +
  geom_line(size = 1) +
  geom_point(size = 3) +
  labs(
    title = "Average DSR by Sample Size",
    x = "Sample Size (Number of Returns)",
    y = "Average DSR",
    color = "Strategy Has\nTrue Edge"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("FALSE" = "blue", "TRUE" = "red"))

p2 <- ggplot(avg_dsr_by_sample, aes(x = factor(sample_size), y = avg_sharpe, 
                               color = has_edge, group = has_edge)) +
  geom_line(size = 1) +
  geom_point(size = 3) +
  labs(
    title = "Average Sharpe Ratio by Sample Size",
    x = "Sample Size (Number of Returns)",
    y = "Average Sharpe Ratio",
    color = "Strategy Has\nTrue Edge"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("FALSE" = "blue", "TRUE" = "red"))

# Display plots side by side
gridExtra::grid.arrange(p1, p2, ncol = 2)

The plots above reveal an interesting relationship between traditional performance metrics (Sharpe ratio) and the DSR across different sample sizes:

  1. Sharpe Ratio vs. DSR: While the Sharpe ratio may remain high even with small samples, the DSR provides a more realistic assessment of strategy quality by accounting for selection bias.

  2. Strategies with True Edge: Notice how strategies with genuine edge (red) show an increasing DSR as the sample size grows, while strategies without edge (blue) show a more stable or decreasing DSR.

  3. Statistical Stability: Larger sample sizes result in more stable estimates of both metrics, but the DSR is particularly sensitive to sample size due to its incorporation of higher moments and selection bias.

This comparison demonstrates why the DSR is a superior metric for strategy evaluation, especially when dealing with multiple testing and limited data.

Part 6: Analyzing Performance of Strategies Based on DSR

Now, let’s simulate the performance of strategies selected based on different criteria to see how DSR predicts out-of-sample performance.

Code
# Function to simulate out-of-sample performance
# This provides a realistic test of whether DSR predicts future performance
simulate_oos_performance <- function(n_strategies = 100, 
                                     in_sample_size = 252,
                                     out_sample_size = 252,
                                     edge_pct = 0.05,
                                     edge_size = 0.0005) {
  # Generate in-sample returns (this represents our "backtest" period)
  # We'll use this data to select strategies and calculate DSR
  strategy_data_in <- generate_random_strategies(
    n_strategies = n_strategies,
    n_returns = in_sample_size,
    edge_pct = edge_pct,
    edge_size = edge_size
  )
  
  in_sample <- strategy_data_in$returns
  has_edge <- strategy_data_in$has_edge
  
  # Calculate in-sample Sharpe ratios (what we'd see during strategy development)
  # These are the performance metrics that would guide strategy selection
  in_sample_sharpes <- apply(in_sample, 2, calculate_sharpe)
  
  # Generate out-of-sample returns (this represents future performance)
  # These are completely new random data, simulating what happens post-implementation
  # But strategies with edge still have edge
  out_sample <- matrix(
    rnorm(n_strategies * out_sample_size, mean = 0, sd = 0.01),
    nrow = out_sample_size,
    ncol = n_strategies
  )
  
  # Add edge to the same strategies that had edge in-sample
  for (i in which(has_edge)) {
    out_sample[, i] <- out_sample[, i] + edge_size
  }
  
  # Calculate out-of-sample Sharpe ratios (the "true" performance we care about)
  # This is what would actually be realized when trading the strategy
  out_sample_sharpes <- apply(out_sample, 2, calculate_sharpe)
  
  # Calculate DSR for each strategy based on in-sample data
  # We use this to test whether DSR effectively predicts out-of-sample performance
  dsrs <- numeric(n_strategies)
  for (i in 1:n_strategies) {
    dsr_result <- calculate_dsr(
      in_sample[, i],
      n_trials = n_strategies,  # Number of trials equals number of strategies
      sr_std = sd(in_sample_sharpes)  # Use actual variation in Sharpe ratios
    )
    dsrs[i] <- dsr_result$dsr
  }
  
  # Combine results into a single dataset for analysis
  results <- tibble(
    strategy = 1:n_strategies,
    in_sample_sharpe = in_sample_sharpes,
    out_sample_sharpe = out_sample_sharpes,
    dsr = dsrs,
    has_edge = has_edge
  )
  
  return(results)
}

# Run simulation with 100 strategies
# This gives us enough data to analyze the relationship between DSR and future performance
set.seed(789)  # For reproducibility
performance_results <- simulate_oos_performance(n_strategies = 100)

# Analyze the relationship between in-sample Sharpe, DSR, and out-of-sample Sharpe
# This visualization reveals whether DSR provides useful information about future performance
performance_results %>%
  ggplot(aes(x = in_sample_sharpe, y = out_sample_sharpe, color = dsr, shape = has_edge)) +
  geom_point(size = 3, alpha = 0.7) +
  scale_color_gradient(low = "red", high = "blue") +  # Color by DSR value
  labs(
    title = "Out-of-Sample vs. In-Sample Sharpe Ratio",
    subtitle = "Color indicates Deflated Sharpe Ratio, Shape indicates True Edge",
    x = "In-Sample Sharpe Ratio",
    y = "Out-of-Sample Sharpe Ratio",
    color = "DSR",
    shape = "Has True Edge"
  ) +
  theme_minimal() +
  # Add regression line to show overall relationship
  geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") +
  # Add reference lines at zero
  geom_hline(yintercept = 0, linetype = "dotted") +
  geom_vline(xintercept = 0, linetype = "dotted") +
  # Add custom shape scale
  scale_shape_manual(values = c("FALSE" = 16, "TRUE" = 17))

Code
# Group strategies by DSR to analyze performance patterns
# This helps us understand whether different DSR ranges predict different outcomes
performance_results <- performance_results %>%
  mutate(dsr_group = cut(dsr, breaks = c(0, 0.2, 0.5, 0.8, 1),
                        labels = c("Very Low (0-0.2)", 
                                   "Low (0.2-0.5)", 
                                   "Medium (0.5-0.8)", 
                                   "High (0.8-1)")))

# Calculate average out-of-sample performance by DSR group
# This quantifies the relationship between DSR category and future returns
dsr_group_performance <- performance_results %>%
  group_by(dsr_group) %>%
  summarise(
    count = n(),  # Number of strategies in each group
    avg_in_sample = mean(in_sample_sharpe),  # Average backtest performance
    avg_out_sample = mean(out_sample_sharpe),  # Average realized performance
    median_out_sample = median(out_sample_sharpe),  # Median (more robust to outliers)
    positive_rate = mean(out_sample_sharpe > 0),  # Proportion with positive returns
    true_edge_rate = mean(has_edge)  # Proportion with true edge
  ) %>%
  arrange(dsr_group)  # Sort by DSR group for readability

# Display the results in a nicely formatted table
kable(dsr_group_performance, 
      caption = "Out-of-Sample Performance by DSR Group",
      digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Out-of-Sample Performance by DSR Group
dsr_group count avg_in_sample avg_out_sample median_out_sample positive_rate true_edge_rate
NA 100 -0.003 0.055 0.054 0.51 0.05
Code
# Visualize out-of-sample performance by DSR group using boxplots
# This shows the full distribution, not just averages
ggplot(performance_results, aes(x = dsr_group, y = out_sample_sharpe, fill = has_edge)) +
  geom_boxplot(alpha = 0.7) +
  labs(
    title = "Out-of-Sample Sharpe Ratio by DSR Group",
    x = "Deflated Sharpe Ratio Group",
    y = "Out-of-Sample Sharpe Ratio",
    fill = "Has True Edge"
  ) +
  theme_minimal() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +  # Reference line at zero
  scale_fill_manual(values = c("FALSE" = "steelblue", "TRUE" = "darkred"))

Assessing the Predictive Value of the Deflated Sharpe Ratio

The analysis above tests a critical question: Does the Deflated Sharpe Ratio actually help predict which strategies will perform well out-of-sample? Let’s interpret what we see:

Key Insights from the Scatter Plot:

  1. Weak In-Sample/Out-of-Sample Correlation: Notice that the overall correlation between in-sample and out-of-sample Sharpe ratios (shown by the dashed trend line) is quite weak. This confirms the well-known challenge that past performance is often a poor predictor of future results, especially with purely random strategies.

  2. DSR Patterns: The color coding reveals an important pattern - strategies with higher DSR values (blue points) tend to maintain more consistent performance between in-sample and out-of-sample periods compared to low DSR strategies (red points). This suggests that DSR provides valuable information about which strategies are more likely to maintain their performance.

  3. True Edge Identification: Strategies with true edge (triangles) tend to have better out-of-sample performance than those without (circles). Importantly, strategies with both true edge and high DSR values cluster in the upper right quadrant, indicating that DSR is effective at identifying strategies with genuine edge.

  4. Extreme Values: Strategies with extremely high in-sample Sharpe ratios but low DSR values often have poor out-of-sample performance, confirming that these were likely false discoveries.

Key Insights from the DSR Group Analysis:

The table and boxplot provide more structured evidence of DSR’s predictive value:

  1. Performance Gradient: There is a clear relationship between DSR group and out-of-sample performance. Strategies with higher DSR values generally show better out-of-sample Sharpe ratios.

  2. True Edge Concentration: The “true_edge_rate” column shows that strategies with higher DSR values are much more likely to have genuine edge. This confirms that DSR is effective at separating true strategies from false ones.

  3. Positive Rate Differences: The “positive_rate” column is particularly telling - it shows the percentage of strategies in each DSR group that maintained positive Sharpe ratios out-of-sample. Higher DSR groups have substantially higher positive rates, indicating more reliable performance.

  4. Variability Within Groups: The boxplot shows substantial variability within each DSR group. This reminds us that while DSR is helpful, it doesn’t perfectly predict future performance - there’s still substantial randomness in outcomes.

Practical Implications:

These findings suggest several practical lessons for strategy development:

  1. Use DSR as a Filter: Rather than selecting strategies solely based on backtested Sharpe ratios, using DSR as a preliminary filter can help eliminate likely false discoveries.

  2. Focus on High DSR Strategies: Strategies with DSR values above 0.8 appear to have more reliable out-of-sample performance, making them better candidates for implementation.

  3. Diversification Remains Important: The substantial variability within DSR groups highlights that even with good statistical tools, diversification across multiple strategies remains essential for risk management.

  4. Statistical vs. Economic Significance: Remember that this simulation uses both random strategies and those with a small edge. In real-world scenarios, combining DSR with sound economic reasoning about why a strategy should work would further improve selection quality.

This analysis demonstrates that properly accounting for selection bias through tools like the Deflated Sharpe Ratio can significantly improve the strategy selection process, leading to more reliable investment performance.

Practical Example: Evaluating a Technical Trading Strategy

Let’s evaluate a simple moving average crossover strategy to demonstrate the DSR calculation process in a realistic context.

Code
# Set seed for reproducibility
set.seed(123)

# Generate market data
n_days <- 1000
market_returns <- rnorm(n_days, mean = 0.0004, sd = 0.01)  # ~10% annual return, 16% vol
prices <- 100 * cumprod(1 + market_returns)

# Create strategy: Buy when short MA crosses above long MA, sell when it crosses below
short_window <- 20
long_window <- 50

# Calculate moving averages
short_ma <- zoo::rollapply(prices, short_window, mean, fill = NA)
long_ma <- zoo::rollapply(prices, long_window, mean, fill = NA)

# Generate signals: 1 for long, -1 for short, 0 for no position
signals <- rep(0, n_days)
for (i in (long_window+1):n_days) {
  # Check if we have valid values for comparison
  if (!is.na(short_ma[i]) && !is.na(short_ma[i-1]) && 
      !is.na(long_ma[i]) && !is.na(long_ma[i-1])) {
    
    if (short_ma[i] > long_ma[i] && short_ma[i-1] <= long_ma[i-1]) {
      signals[i] <- 1  # Buy signal
    } else if (short_ma[i] < long_ma[i] && short_ma[i-1] >= long_ma[i-1]) {
      signals[i] <- -1  # Sell signal
    } else {
      signals[i] <- signals[i-1]  # Maintain previous position
    }
  } else {
    signals[i] <- 0  # No signal when we don't have enough data
  }
}

# Convert signals to position vector (1 = long, -1 = short, 0 = neutral)
positions <- rep(0, n_days)
for (i in (long_window+1):n_days) {
  if (signals[i] == 1) positions[i] <- 1
  else if (signals[i] == -1) positions[i] <- -1
  else positions[i] <- positions[i-1]
}

# Calculate strategy returns
strategy_returns <- c(0, positions[-n_days] * market_returns[-1])

# Analyze basic performance
sharpe_standard <- mean(strategy_returns[(long_window+1):n_days]) / 
                    sd(strategy_returns[(long_window+1):n_days]) * sqrt(252)

# Now let's assume this strategy was selected from 50 different parameter combinations
# Calculate DSR
dsr_result <- calculate_dsr(
  strategy_returns[(long_window+1):n_days],
  n_trials = 50,  # 50 different MA combinations were tested
  sr_std = sd(strategy_returns[(long_window+1):n_days]) * sqrt(252)
)

# Display results
cat("Traditional Sharpe Ratio (annualized):", round(sharpe_standard, 2), "\n")
Traditional Sharpe Ratio (annualized): 0.12 
Code
cat("Expected Max Sharpe Ratio from 50 trials:", 
    round(dsr_result$expected_max_sr * sqrt(252), 2), "\n")
Expected Max Sharpe Ratio from 50 trials: 5.69 
Code
cat("Deflated Sharpe Ratio:", round(dsr_result$dsr, 4), "\n")
Deflated Sharpe Ratio: 0 
Code
# Visualize strategy performance
# Create data frame with cumulative returns
performance_data <- data.frame(
  Day = 1:n_days,
  Price = prices,
  ShortMA = short_ma,
  LongMA = long_ma,
  Position = positions,
  CumulativeReturn = cumprod(1 + strategy_returns) - 1,
  MarketCumulativeReturn = cumprod(1 + market_returns) - 1
)

# Plot price and moving averages
p1 <- ggplot(performance_data, aes(x = Day)) +
  geom_line(aes(y = Price), color = "black") +
  geom_line(aes(y = ShortMA), color = "blue", linetype = "solid") +
  geom_line(aes(y = LongMA), color = "red", linetype = "solid") +
  labs(
    title = "Price and Moving Averages",
    x = "Trading Day",
    y = "Price"
  ) +
  theme_minimal()

# Plot strategy cumulative returns
p2 <- ggplot(performance_data, aes(x = Day)) +
  geom_line(aes(y = CumulativeReturn, color = "Strategy")) +
  geom_line(aes(y = MarketCumulativeReturn, color = "Market")) +
  labs(
    title = "Cumulative Returns",
    subtitle = paste("Sharpe:", round(sharpe_standard, 2), 
                    "DSR:", round(dsr_result$dsr, 4)),
    x = "Trading Day",
    y = "Cumulative Return",
    color = "Series"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("Strategy" = "blue", "Market" = "black"))

# Display plots
gridExtra::grid.arrange(p1, p2, ncol = 1)

Interpretation of Results

The traditional Sharpe ratio of our moving average crossover strategy looks promising, but the DSR tells a different story. With a DSR of approximately 0, there’s only about a 0% probability that our strategy’s performance is not due to selection bias.

This illustrates why considering the multiple testing problem is essential in strategy development. What initially appears to be a good strategy might actually be a false discovery when we account for the selection process. In this case, if we tested 50 different moving average parameter combinations and selected the best one, the expected maximum Sharpe ratio under randomness would be quite high, making it difficult to distinguish a truly effective strategy from a lucky one.

Conclusion

In this tutorial, we explored the impact of selection bias under multiple testing in quantitative finance. We found that:

  1. When testing multiple strategies, the maximum Sharpe ratio increases with the number of trials, even when the true Sharpe ratio is zero.
  2. The Deflated Sharpe Ratio provides a way to correct for this selection bias by estimating the probability that a strategy’s performance is not due to chance.
  3. False discovery rates are closely related to the proportion of true strategies in the testing pool, with higher false discovery rates when true strategies are rare.
  4. Larger sample sizes generally lead to more reliable DSR estimates.
  5. Strategies with higher DSR values tend to have better out-of-sample performance, confirming the value of this approach for strategy selection.

These findings highlight the importance of accounting for multiple testing when developing investment strategies. By using metrics like the Deflated Sharpe Ratio, researchers can reduce the risk of implementing false strategies and improve their overall investment performance.

Glossary of Key Terms

Backtest Overfitting: The process by which a model or strategy is excessively customized to historical data, capturing noise rather than signal, resulting in poor out-of-sample performance.

Deflated Sharpe Ratio (DSR): A statistical measure that adjusts the Sharpe ratio to account for selection bias under multiple testing, skewness, kurtosis, and sample length.

False Discovery Rate (FDR): The expected proportion of false positives among all discoveries (positive results).

False Strategy Theorem: A mathematical theorem showing that the expected maximum Sharpe ratio increases with the number of trials, even when all strategies have zero true edge.

Multiple Testing Problem: The statistical issue that arises when many hypotheses are tested simultaneously, increasing the likelihood of false positives.

Precision: In strategy evaluation, the probability that a strategy with a positive backtest is truly profitable.

Recall: In strategy evaluation, the probability that a truly profitable strategy will show a positive backtest.

Selection Bias under Multiple Testing (SBuMT): The statistical inflation that occurs when a researcher conducts

Selection Bias under Multiple Testing (SBuMT): The statistical inflation that occurs when a researcher conducts multiple trials but reports only the best result.

Sharpe Ratio: A measure of risk-adjusted return, calculated as the ratio of excess returns to volatility.

Type I Error (False Positive): Incorrectly rejecting a true null hypothesis (e.g., identifying a strategy as profitable when it isn’t).

Type II Error (False Negative): Failing to reject a false null hypothesis (e.g., failing to identify a truly profitable strategy).

Exercises for Students

Basic Exercises

  1. Sharpe Ratio Distribution: Modify the generate_random_strategies function to create 500 random strategies with 126 days (6 months) of returns. Calculate and plot the distribution of Sharpe ratios. How does it compare to the original distribution with 252 days?

  2. Expected Maximum Sharpe: Use the expected_max_sharpe function to calculate the expected maximum Sharpe ratio for different numbers of trials: 10, 50, 100, 500. Create a line plot showing how the expected maximum Sharpe increases with more trials.

Intermediate Exercises

  1. Strategies with True Edge: Modify the generate_random_strategies function to include a parameter that specifies what percentage of strategies should have a true edge. For these strategies, add a small positive drift of 0.0005 per day (approximately 12% annually). Calculate the proportion of strategies with true edge that are identified as significant using:
    1. Traditional t-test at 5% significance
    2. DSR > 0.95 threshold
# Starter code for Exercise 3
generate_strategies_with_edge <- function(n_strategies = 1000, 
                                         n_returns = 252, 
                                         mean_return = 0, 
                                         sd_return = 0.01,
                                         edge_pct = 0.05,    # Percentage of strategies with edge
                                         edge_size = 0.0005) # Size of daily edge
{
  # Create a matrix of random returns
  returns_matrix <- matrix(
    rnorm(n_strategies * n_returns, mean = mean_return, sd = sd_return),
    nrow = n_returns,
    ncol = n_strategies
  )
  
  # Add edge to a subset of strategies
  n_edge_strategies <- round(n_strategies * edge_pct)
  if (n_edge_strategies > 0) {
    # Add a small positive drift to create genuine edge
    for (i in 1:n_edge_strategies) {
      returns_matrix[, i] <- returns_matrix[, i] + edge_size
    }
  }
  
  # Name each strategy for easier reference
  colnames(returns_matrix) <- paste0("Strategy_", 1:n_strategies)
  
  # Create a vector indicating which strategies have true edge
  has_edge <- logical(n_strategies)
  has_edge[1:n_edge_strategies] <- TRUE
  
  return(list(
    returns = returns_matrix,
    has_edge = has_edge
  ))
}

Complete the exercise by comparing traditional and DSR approachesMonte Carlo Simulation: Create a Monte Carlo simulation to investigate how different parameters affect the DSR’s ability to detect true strategies.

Specifically, examine:

  1. The impact of varying the true edge size (from very small to very large)
  2. The effect of varying the proportion of strategies with true edge
  3. The relationship between sample size and detection power

Advanced Exercises

DSR with Cross-Validation: Implement a function that uses k-fold cross-validation to estimate the out-of-sample performance of a strategy.

Calculate both the traditional Sharpe ratio and the DSR for each fold, then compare the average results.

Portfolio Construction with DSR: Create a portfolio allocation scheme based on DSR values.

Test different allocation rules: a) Invest only in strategies with DSR > 0.8 b) Weight investments proportionally to DSR values c) Equal weight across all strategies Compare the out-of-sample performance of these approaches.

Industry Application: Apply the DSR framework to evaluate a real-world trading strategy using either:

  1. A well-known technical indicator (e.g., RSI, MACD)

  2. A factor model (e.g., value, momentum, quality) Analyze how the strategy performs in different market regimes and whether the DSR provides a better assessment than traditional metrics.

Further Reading

Foundational Papers

Bailey, D. H., & Lopez de Prado, M. (2014). “The deflated Sharpe ratio: Correcting for selection bias, backtest overfitting, and non-normality.” Journal of Portfolio Management, 40(5), 94-107. Harvey, C. R., & Liu, Y. (2015). “Backtesting.” Journal of Portfolio Management, 42(1), 13-28. Bailey, D. H., Borwein, J. M., Lopez de Prado, M., & Zhu, Q. J. (2017). “The probability of backtest overfitting.” Journal of Computational Finance, 20(4), 39-69.

Practical Implementation

Lopez de Prado, M. (2018). Advances in Financial Machine Learning. John Wiley & Sons. Chapters 10-12. Arnott, R. D., Harvey, C. R., & Markowitz, H. (2019). “A backtesting protocol in the era of machine learning.” Journal of Financial Data Science, 1(1), 64-74.

Online Resources

Quantitative Finance Stack Exchange: DSR Implementation Lopez de Prado’s GitHub Repository Marcos Lopez de Prado - False Strategy Theorem (Video lecture)

Advanced Topics

Harvey, C. R., & Liu, Y. (2018). “False (and Missed) Discoveries in Financial Economics.” Journal of Finance, 78(5), 2503-2553. Chen, A. Y., & Zimmermann, T. (2022). “Open source cross-sectional asset pricing.” Critical Finance Review, 11(2), 207-264. Lopez de Prado, M. (2019). “A Data Science Solution to the Multiple-Testing Crisis in Financial Research.” Journal of Financial Data Science, 1(1), 99-110.